dropout training
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China (0.04)
On Convergence and Generalization of Dropout Training
We study dropout in two-layer neural networks with rectified linear unit (ReLU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that the dropout training with logistic loss achieves $\epsilon$-suboptimality in the test error in $O(1/\epsilon)$ iterations.
Dropout Training as Adaptive Regularization
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an \LII regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learner, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer.
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China (0.04)
On Convergence and Generalization of Dropout Training
We study dropout in two-layer neural networks with rectified linear unit (ReLU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that the dropout training with logistic loss achieves \epsilon -suboptimality in the test error in O(1/\epsilon) iterations.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > United Kingdom (0.04)
Dropout Training as Adaptive Regularization
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an \LII regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learner, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer.
Information Geometry of Dropout Training
Kimura, Masanari, Hino, Hideitsu
Deep neural networks have been experimentally successful in a variety of fields (Deng and Yu, 2014; LeCun et al., 2015; Goodfellow et al., 2016). Dropout is one of the techniques that contribute to the performance improvement of neural networks (Srivastava et al., 2014). Many experimental results have reported the effectiveness of dropout, making it an important technique for training neural networks (Wu and Gu, 2015; Pham et al., 2014; Park and Kwak, 2016; Labach et al., 2019). Furthermore, the simplicity of the idea of dropout has led to the proposal of a great number of variants (Iosifidis et al., 2015; Moon et al., 2015; Gal et al., 2017; Zolna et al., 2017; Hou and Wang, 2019; Keshari et al., 2019; Ma et al., 2020). Understanding the behavior of such an important technique can be a way to know which of these variants to use, and in what cases dropout is effective in the first place.
On Convergence and Generalization of Dropout Training
We study dropout in two-layer neural networks with rectified linear unit (ReLU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that dropout training with logistic loss achieves $\epsilon$-suboptimality in test error in $O(1/\epsilon)$ iterations.
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China (0.04)
Machine Learning's Dropout Training is Distributionally Robust Optimal
Blanchet, Jose, Kang, Yang, Olea, Jose Luis Montiel, Nguyen, Viet Anh, Zhang, Xuhui
This paper shows that dropout training in Generalized Linear Models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician's covariates using a multiplicative nonparametric errors-in-variables model. In this game---known as a Distributionally Robust Optimization problem---nature's least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability $\delta$. Our decision-theoretic analysis shows that dropout training---the statistician's minimax strategy in the game---indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. This paper also provides a novel, parallelizable, Unbiased Multi-Level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector.
- North America > United States > California (0.14)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)